{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Uj7R48-ZXmTC"
      },
      "source": [
        "# Retrieval Augmented Generation (Using Open Source Vector Store) - Procurement Contract Analyst - Palm2 & LangChain\n",
        "\n",
        "<table align=\"left\">\n",
        "\n",
        "  <td>\n",
        "    <a href=\"https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\">\n",
        "      <img src=\"https://cloud.google.com/ml-engine/images/colab-logo-32px.png\" alt=\"Colab logo\"> Run in Colab\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\">\n",
        "      <img width=\"32px\" src=\"https://raw.githubusercontent.com/primer/octicons/refs/heads/main/icons/mark-github-24.svg\" alt=\"GitHub logo\">\n",
        "      View on GitHub\n",
        "    </a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\">\n",
        "      <img src=\"https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32\" alt=\"Vertex AI logo\">\n",
        "      Open in Vertex AI Workbench\n",
        "    </a>\n",
        "  </td>\n",
        "</table>\n",
        "\n",
        "<div style=\"clear: both;\"></div>\n",
        "\n",
        "<b>Share to:</b>\n",
        "\n",
        "<a href=\"https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg\" alt=\"LinkedIn logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg\" alt=\"Bluesky logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg\" alt=\"X logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png\" alt=\"Reddit logo\">\n",
        "</a>\n",
        "\n",
        "<a href=\"https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/search/retrieval-augmented-generation/examples/contract_analysis.ipynb\" target=\"_blank\">\n",
        "  <img width=\"20px\" src=\"https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg\" alt=\"Facebook logo\">\n",
        "</a>            "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d82356437c6e"
      },
      "source": [
        "| | |\n",
        "|-|-|\n",
        "|Author(s) | [Guru Rangavittal](https://github.com/guruvittal) |"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SPG8eRQCcNQ3"
      },
      "source": [
        "## Installation & Authentication\n",
        "\n",
        "Install LangChain, Vertex AI LLM SDK, ChromaDB, and related libraries.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IdsE3DEJcM35"
      },
      "outputs": [],
      "source": [
        "%pip install -q google-cloud-aiplatform==1.36.0 langchain==0.0.327 unstructured chromadb==0.4.15 --upgrade --user"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R5Xep4W9lq-Z"
      },
      "source": [
        "### Restart current runtime\n",
        "\n",
        "To use the newly installed packages in this Jupyter runtime, you must restart the runtime. You can do this by running the cell below, which will restart the current kernel."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "id": "XRvKdaPDTznN"
      },
      "outputs": [],
      "source": [
        "# Restart kernel after installs so that your environment can access the new packages\n",
        "\n",
        "import IPython\n",
        "\n",
        "app = IPython.Application.instance()\n",
        "app.kernel.do_shutdown(True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SbmM4z7FOBpM"
      },
      "source": [
        "<div class=\"alert alert-block alert-warning\">\n",
        "<b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. ⚠️</b>\n",
        "</div>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ptwMB9pqcniz"
      },
      "source": [
        "**Authenticate**\n",
        "Within the Colab a simple user authentication is adequate.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "id": "HP80SWi0rIBL"
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "if \"google.colab\" in sys.modules:\n",
        "    from google.colab import auth as google_auth\n",
        "\n",
        "    google_auth.authenticate_user()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8a1cc7ca1e0d"
      },
      "source": [
        "## Get Libraries & Classes\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4lcArn48r6pZ"
      },
      "source": [
        "**Reference Libraries**\n",
        "\n",
        "In this section, we will identify all the library classes that will be referenced in the code.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "ax6hlCt7YbXx"
      },
      "outputs": [],
      "source": [
        "from google.cloud import aiplatform\n",
        "from langchain.document_loaders import GCSDirectoryLoader\n",
        "from langchain.embeddings import VertexAIEmbeddings\n",
        "from langchain.llms import VertexAI\n",
        "\n",
        "# Chroma DB as Vector Store Database\n",
        "from langchain.vectorstores import Chroma\n",
        "\n",
        "# Using Vertex AI\n",
        "import vertexai\n",
        "\n",
        "print(f\"Vertex AI SDK version: {aiplatform.__version__}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7yHf4ipxYyfy"
      },
      "source": [
        "## Initialize Vertex AI\n",
        "\n",
        "**We will need a project id and location where the Vertex AI compute and embedding will be hosted**\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 10,
      "metadata": {
        "id": "EZe8iS2CY2E8"
      },
      "outputs": [],
      "source": [
        "PROJECT_ID = \"PROJECT_ID\"  # @param {type:\"string\"}\n",
        "\n",
        "LOCATION = \"us-central1\"  # @param {type:\"string\"}\n",
        "\n",
        "# Initialize Vertex AI SDK\n",
        "vertexai.init(project=PROJECT_ID, location=LOCATION)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7YozENqoAmm9"
      },
      "source": [
        "## Ingest the Contracts to build the context for the LLM\n",
        "\n",
        "_Load all the Procurement Contract Documents_\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 11,
      "metadata": {
        "id": "Td4rD2MQtM1O"
      },
      "outputs": [],
      "source": [
        "loader = GCSDirectoryLoader(\n",
        "    project_name=PROJECT_ID, bucket=\"contractunderstandingatticusdataset\"\n",
        ")\n",
        "documents = loader.load()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5Da7_1bpFGpb"
      },
      "source": [
        "_Split documents into chunks as needed by the token limit of the LLM and let there be an overlap between the chunks_\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 12,
      "metadata": {
        "id": "2E_qzSuMFHKt"
      },
      "outputs": [],
      "source": [
        "# split the documents into chunks\n",
        "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
        "\n",
        "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
        "docs = text_splitter.split_documents(documents)\n",
        "print(f\"# of documents = {len(docs)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f0htdnYAHonv"
      },
      "source": [
        "## Structuring the ingested documents in a vector space using a Vector Database\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TuwJlBNwIy0y"
      },
      "source": [
        "_Create an embedding vector engine for all the text in the contract documents that have been ingested_\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "_nNWJ6XaH1fa"
      },
      "outputs": [],
      "source": [
        "# Define Text Embeddings model\n",
        "embedding = VertexAIEmbeddings()\n",
        "\n",
        "\n",
        "embedding"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_A6JEaReI96b"
      },
      "source": [
        "_Create a vector store and store the embeddings in the vector store_\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ITiqOdEDJGOn"
      },
      "outputs": [],
      "source": [
        "contracts_vector_db = Chroma.from_documents(docs, embedding)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oVCftNbU-wp0"
      },
      "source": [
        "## Obtain handle to the retriever\n",
        "\n",
        "We will use the native retriever provided by Chroma DB to perform similarity search within the contracts document vector store among the different document chunks so as to return that document chunk which has the lowest vectoral \"distance\" with the incoming user query.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "Xb2VCn6e-0zp"
      },
      "outputs": [],
      "source": [
        "# Expose index to the retriever\n",
        "retriever = contracts_vector_db.as_retriever(\n",
        "    search_type=\"similarity\", search_kwargs={\"k\": 2}\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QcwQNvfN_6Rn"
      },
      "source": [
        "## Define a Retrieval QA Chain to use retriever\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "id": "ElWUO3fQAMaH"
      },
      "outputs": [],
      "source": [
        "# Create chain to answer questions\n",
        "from langchain.chains import RetrievalQA\n",
        "\n",
        "llm = VertexAI(\n",
        "    model_name=\"gemini-2.0-flash\",\n",
        "    max_output_tokens=256,\n",
        "    temperature=0.1,\n",
        "    top_p=0.8,\n",
        "    top_k=40,\n",
        "    verbose=True,\n",
        ")\n",
        "\n",
        "# Uses LLM to synthesize results from the search index.\n",
        "# We use Vertex AI PaLM Text API for LLM\n",
        "qa = RetrievalQA.from_chain_type(\n",
        "    llm=llm, chain_type=\"stuff\", retriever=retriever, return_source_documents=True\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BHYlgYQhFTQW"
      },
      "source": [
        "## Leverage LLM to search from retriever\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tUZDkQOrGb9X"
      },
      "source": [
        "_Example:_\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "id": "JGTS8x7TFoOn"
      },
      "outputs": [],
      "source": [
        "query = \"Who all entered into agreement with Sagebrush?\"\n",
        "result = qa({\"query\": query})\n",
        "print(result)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FgYa4k1WiolM"
      },
      "source": [
        "## Build a Front End\n",
        "\n",
        "Enable a simple front end so users can query against contract documents and obtain intelligent answers with grounding information that references the base documents that was used to respond to user query\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4a84715c549d"
      },
      "outputs": [],
      "source": [
        "%pip install -q gradio"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CRjBQsXqirbC"
      },
      "outputs": [],
      "source": [
        "from google.cloud import storage\n",
        "import gradio as gr\n",
        "\n",
        "\n",
        "def chatbot(input_text):\n",
        "    result = qa({\"query\": input_text})\n",
        "\n",
        "    return (\n",
        "        result[\"result\"],\n",
        "        get_public_url(result[\"source_documents\"][0].metadata[\"source\"]),\n",
        "        result[\"source_documents\"][0].metadata[\"source\"],\n",
        "    )\n",
        "\n",
        "\n",
        "def get_public_url(uri):\n",
        "    \"\"\"Returns the public URL for a file in Google Cloud Storage.\"\"\"\n",
        "    # Split the URI into its components\n",
        "    components = uri.split(\"/\")\n",
        "\n",
        "    # Get the bucket name\n",
        "    bucket_name = components[2]\n",
        "\n",
        "    # Get the file name\n",
        "    file_name = components[3]\n",
        "\n",
        "    storage_client = storage.Client()\n",
        "    bucket = storage_client.bucket(bucket_name)\n",
        "    blob = bucket.blob(file_name)\n",
        "    return blob.public_url\n",
        "\n",
        "\n",
        "print(\"Launching Gradio\")\n",
        "\n",
        "iface = gr.Interface(\n",
        "    fn=chatbot,\n",
        "    inputs=[gr.Textbox(label=\"Query\")],\n",
        "    examples=[\n",
        "        \"Who are parties to ADMA agreement\",\n",
        "        \"What is the agreement between MICOA & Stratton Cheeseman\",\n",
        "        \"What is the commission % that Stratton Cheeseman will get from MICOA and how much will they get if MICOA's revenues are $100\",\n",
        "    ],\n",
        "    title=\"Contract Analyst\",\n",
        "    outputs=[\n",
        "        gr.Textbox(label=\"Response\"),\n",
        "        gr.Textbox(label=\"URL\"),\n",
        "        gr.Textbox(label=\"Cloud Storage URI\"),\n",
        "    ],\n",
        "    theme=gr.themes.Soft,\n",
        ")\n",
        "\n",
        "iface.launch(share=False)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "name": "contract_analysis.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
